Identifying missing dictionary entries with frequency-conserving context models

نویسندگان

  • Jake Ryland Williams
  • Eric M. Clark
  • James P. Bagrow
  • Christopher M. Danforth
  • Peter Sheridan Dodds
چکیده

In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary, an extensive, online, collaborative, and open-source dictionary that contains over 100000 phrasal definitions, we develop highly effective filters for the identification of meaningful, missing phrase entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, developing a breakthrough, lexical extraction technique and expanding our knowledge of the defined English lexicon of phrases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

German Verb Patterns and Their Implementation in an Electronic Dictionary

Here we describe an electronic lexical resource for German and the structure of its lexicon entries, notably the structure of verbal single-word and multi-word entries. The verb as the center of the sentence structure, as held by dependency models, is also a basic principle of the JAKOB narrative analysis application, for which the dictionary is the background. Different linguistic layers are c...

متن کامل

Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus

We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional statistics derived from parallel texts. There is little correlation between such statistics of a word ...

متن کامل

Generating acceptable Arabic Core Vocabularies and Symbols for AAC users

This paper discusses the development of an Arabic Symbol Dictionary for Augmentative and Alternative Communication (AAC) users, their families, carers, therapists and teachers as well as those who may benefit from the use of symbols to enhance literacy skills. With a requirement for a bi-lingual dictionary, a vocabulary list analyzer has been developed to evaluate similarities and differences i...

متن کامل

An Investigation into Bilingual Dictionary Use: Do the Frequency of Use and Type of Dictionary Make a Difference in L2 Writing Performance?

Bilingual dictionary use in L2 writing test performance has recently been the subject of debate. Opinions differ according to how the trait is understood and whether the system favors the process-oriented or product-oriented views towards the assessment and writing skill. Given the need for more empirical support, this study is aimed at investigating the availability of bilingual dictionary use...

متن کامل

Using corpus information to improve MT quality

The paper presents activities to improve the quality of a rule-based MT system, using corpus information. If focuses on the area of dictionary, and shows how, and to which extent, corpus-based information can improve the system quality in the different areas of dictionary development. It deals with two main sources of errors: missing entries / translations, and wrong selections of one out of se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Physical review. E, Statistical, nonlinear, and soft matter physics

دوره 92 4  شماره 

صفحات  -

تاریخ انتشار 2015